-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: run file hash algorithms in parallel #3636
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Keith Zantow <[email protected]>
Signed-off-by: Keith Zantow <[email protected]>
Signed-off-by: Keith Zantow <[email protected]>
Signed-off-by: Keith Zantow <[email protected]>
I pulled this, built it locally, and then tested it with a few containers. Maybe I'm doing it wrong, but I don't see a positive difference. Is it only going to benefit certain use cases or container types? I used three different containers, and ran this syft v1.19.0 and v1.19.0 with this patch (v1.19.0-pfh), with increasing
Is there a better test I could do? docker.io/nextcloud:latest
docker.io/opensearchproject/opensearch:latest
docker.io/pytorch/pytorch:latest
|
@popey -- hmm... It's probably worth comparing apples to apples -- testing this vs. |
Ok, I'll re-run with the new update, and larger degree of parallelism. What's your definition of "small" in image terms? The ones I'm currently using are this kinda size...
|
For some reason, I didn't look at the image names 🤦 This change is a lot less about number of files and more about total bytes to process -- what are the sizes in GB? Uncompressed sizes are: |
Are these too small? I went for something a little bigger:
docker.io/nextcloud:latest
docker.io/opensearchproject/opensearch:latest
docker.io/pytorch/pytorch:latest
docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest
Hm, this is weird. I see the pfh PR has better times when using high parallelism, but not quite sure why the v1.19.0 ones are faster than the pfh ones to start with!? |
Are you still using the |
v1.19.0-pfh is this PR - rebuilt a couple of hours ago, after this PR was updated. Maybe poorly named, it's just this PR. |
@popey right, so it does not include all the other changes on |
Maybe, I'm more looking at this from a user perspective. What will 1.20 (or whatever it's called) look like compared to 1.19. |
I re-ran my tests on this PR using measure-syft. It ran against this PR and main five times each. The summary is below, and specific details from the logs are further down. Looks great! Syft Performance Test ResultsDate: 2025-02-07 15:31:21
Results
Logs snippetsMain
feat/parallelize-file-hashing
|
Signed-off-by: Keith Zantow <[email protected]>
Description
This change plumbs through the parallelism config to the file hasher, so files are hashed in parallel using up to the number of threads specified in the parallelism config.
This is related to: #3266
Type of change
Checklist: